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We consider the problem of clustering a set of high-dimensional data points into sets of 
low-dimensional linear subspaces. The number of subspaces, their dimensions, and their ori- 
entations are unknown. We propose a simple and low-complexity clustering algorithm based 
on thresholding the correlations between the data points followed by spectral clustering. A 
probabilistic performance analysis shows that this algorithm succeeds even when the subspaces 
intersect, and when the dimensions of the subspaces scale (up to a log-factor) linearly in the 
^ , ambient dimension. Moreover, we prove that the algorithm also succeeds for data points that 

O ' are subject to erasures with the number of erasures scaling (up to a log-factor) linearly in the 

ambient dimension. Finally, we propose a simple scheme that provably detects outliers. 

> 

\0 ■ 1 Introduction 



Suppose we are given a set X oi N data points in M™, and assume that = A^i U ... U U O, 
where the points in Xi lie in a (low-dimensional) linear subspace Si of M™, and O denotes a set of 
Q \ outliers. The association of the data points with the sets Xi and O, the number of subspaces L, 

' their dimensions d/, and their orientations are all unknown. We consider the problem of clustering 

the data points, i.e., of finding the assignments of the points in X to the sets Xi and O. Note 
that once these associations have been identified, it is straightforward to extract the subspaces 
^ ' Si through principal component analysis (PC A). The problem we consider is known as subspace 

\^ • clustering and has applications in, e.g., unsupervised learning, image processing, disease detection, 

and, in particular, computer vision, see, e.g., [Ij and references therein. Numerous approaches to 
subspace clustering are known. We refer to [1] for an excellent overview. 

Spectral clustering (SC) methods (see [2] for an introduction) have found particularly widespread 
use. At the heart of SC lies the construction of an adjacency matrix A G R^^^, with the (z, j)th 
entry of A measuring the similarity between the data points Xj,Xj G X. A typical similarity mea- 
sure is, e.g., e~'^^^^^^^'^^\ where dist(-,-) is some distance measure [J. Taking G to be the graph 
with adjacency matrix A, the association of the points in X to the sets Xi (outliers are typically 
removed in a preprocessing step) is obtained by finding the connected components in G, accom- 
plished via singular value decomposition of the Laplacian of G followed by k-means clustering |2j . 
Whether a SC algorithm, or for that matter, any clustering algorithm, succeeds depends on the 
number of subspaces L, their dimensions and relative orientations, and the number of points in each 
subspace. Analytic results on the performance of SC methods are scarce. A notable exception is 
the sparse subspace clustering (SSC) algorithm, recently introduced by Elhamifar and Vidal t3t|4j. 
At the heart of this algorithm lies a clever construction of A that uses ideas from sparse signal 
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recovery. Soltanolkotabi and Candes [5] presented an elegant (geometric function) analysis of SSC 
and proved that SSC succeeds under very general conditions. Most importantly, it is shown in 
[S], using a probabilistic analysis, that SSC succeeds even when the subspaces 5/ intersect, which 
means the Si do not need to be independent or disjoin10. Moreover, Soltanolkotabi and Candes 
[S] provide a clever extension of SSC that provably detects outliers. To construct the adjacency 
matrix A SSC requires the solution of -minimization problems, each in unknowns; this can 
pose significant computational challenges for large data sets. 

Contributions: We introduce an algorithm, termed thresholding based subspace clustering 
(TSC), which applies spectral clustering to an adjacency matrix A obtained by thresholding corre- 
lations between the data points in X. TSC is shown to succeed even when the subspaces intersect, 
and when their dimensions scale (up to a log-factor) linearly in the ambient dimension. While SSC 
shares these desirable properties, TSC is computationally much less demanding, as the construction 
of the adjacency matrix A in the TSC algorithm requires the computation of N'^ inner products 
followed by thresholding only. Moreover, the performance analysis of TSC, thanks to the algo- 
rithm's simplicity, does not need sophisticated mathematical tools; it is based on fairly standard 
concentration results for order statistics only. 

In practical applications the data points to be clustered are often subject to erasures, caused by, 
e.g., scratches on images. The literature is essentially void of corresponding analytic performance 
results. We prove that TSC succeeds even when the data points in X are subject to massive 
erasures. Specifically, the number of erasures is allowed to scale (up to a log-factor) linearly in 
the ambient dimension. We finally propose a simple scheme that provably detects outliers, and 
we corroborate our findings by numerical results. Proofs of the theorems in this paper, results on 
clustering noisy data points, and numerical results for real data sets are provided in [6]. 

We finally note that Lauer and Schnorr also apply SC to an adjacency matrix constructed 
from correlations between data points, albeit, without thresholding. Moreover, no analytic perfor- 
mance results are available for the algorithm in [7j. 

Notation: We use lowercase boldface letters to denote (column) vectors, e.g., x, and uppercase 
boldface letters to designate matrices, e.g., A. For the vector x, [x]g and Xq denote the qth entry 
and for the matrix A, Aij stands for the entry in the ith. row and jth column. The spectral norm 

of A is ||A||2_^2 ™^-^l|v||2=i Il-^'^ll2' Frobenius norm is ||A||^ := ^^^^(Ajj)^, and I denotes 
the identity matrix. The superscript ^ stands for transposition and log(-) for the natural logarithm. 
The cardinality of the set T is \T\- We write AA(/x, 51) for a Gaussian random vector with mean fj, 
and covariance matrix S. The unit sphere in is S'^~^ := {x G W"". ||x||2 = 1}. 



2 The TSC algorithm 

The formulation introduced below assumes that outliers have already been removed from A", e.g., 
through the outlier detection scheme in Sec. [5j Given a set of data point^ X and the parameter q 
(the choice of q is discussed below), the TSC algorithm consists of the following steps: 

^The linear subspaces Si are called disjoint if Si H Sk = {0} for all / 7^ k, and independent if dim(®;5'i) — "^^i di, 
where © stands for direct sum. An independent set of subspaces is disjoint, but the converse is not necessarily true. 
Two subspaces are said to intersect if SiC\ Sk ^ {0}. 

^We assume the data points to be either normalized or to be of comparable norm. This assumption is not restrictive 
as the data points can be normalized prior to clustering. 
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Step 1: For every Xj £ X, identify the set Tj C {1, N} \ j of cardinality q such that 

|(xj,Xj)| > |(xj,Xp)| for all i & Tj and all p ^ Tj (1) 

and let Zj E M.^ be the vector with ith entry |(xj,Xj)| if i G Tj, and if i ^ T^-. Construct the 
adjacency matrix according to Aij = \[zj]i\ + |[zj]j|. 

Step 2: Estimate the number of subspaces using the eigengap heuristic [2j according to L = 
argmaxj=i^...^7V-i(Aj+i — Aj), where Ai < A2 < ... < Xn are the eigenvalues of the normalized 
Laplacian of the graph with adjacency matrix A. 

Step 3: Apply normalized SC [2] to (A,L). 

TSC is said to succeed if the TSC subspace detection property according to the following 
definition holds. 

Definition 1. The TSC subspace detection property holds for X = XiD ... U Xl <ind adjacency 
matrix A if 

i. Aij 7^ only if Xj and Xj belong to the same set Xi 
and if 

a. for all i = 1, ...jN, Aij 7^ for at least q pairs Xj and Xj that belong to the same set Xi. 

The idea behind Def. [H inspired by the £1 subspace detection property introduced in [5j, is 
the following. If the TSC subspace detection property holds, then each node in the Graph G with 
adjacency matrix A is connected to at least q other nodes, all of which correspond to points in 
the same subspace. In the SC step, the assignments of the points to clusters are then determined 
through identification of the connected components of G. We will see in the numerical results 
section, that even if the TSC subspace detection property does not hold strictly, but the Aij for 
pairs Xi,Xj belonging to different subspaces are sufficiently small, SC can still yield the correct 
result. 

Assumptions for performance analysis: For expositional convenience we take all subspaces 
to have equal dimension d, and let the number of points in each of the subspaces be n, (i.e., 
\Xi\ =n,l = 1,...,L). 

Choice of q: Choosing q too small/large will lead to over/under-estimation of the number of 
subspaces L. A sensible choice is to take g to be a fraction of n. This motivates setting q = n/p, 
where p > 1. The results we obtain will ensure that, under certain conditions, the TSC subspace 
detection property holds, provided that p is not too small, while the specific choice of p will not 
matter. Moreover, numerical results in [6J indicate that TSC is not sensitive to the specific choice 
of q. 



3 Deterministic subspaces 



In order to understand the impact of the relative orientations of the subspaces on the performance 
of TSC, we take the subspaces to be deterministic and the points in the subspaces to be random. 



W.l.o.g. we represent the points in Si as x^^ = XJ'^^^slj^ j = 1, where a^^ S M"^ and U^'^ is a 
basis for the d-dimensional subspace Si. We present two results that depend on different notions 
of affinity between subspaces, namely 



affp(5A;, Si) := 



2-5-2 
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and [3 Def. 2.6] 



aS{Sk,Si) := U^'^) U« /Vd, 

F 



both of which can be interpreted as measures of the relative orientations of the subspaces. Through- 
out this section, we assume that the U^'^ are orthonormal bases, and hence < aS{Sk,Si) < 
affp(5'fc, 5;) < 1. The relation between the two affinity notions is brought out by noting that 
affp(5'fc, Si) = cos (01 ) while aff (5^, Si) = y^cos^(0i) + ... + cos'^{9d)/ Vd, where 6i < ... < Od are the 
principal angles between S^ and Si. 

Theorem 1. Suppose n = pq data points are chosen in each of the L subspaces at random according 
to xj/^ = V^^^afj = 1, ...,n, where the af are i.i.d. Af{0, {l/d)I) and p > 10/3. // 

maxaffp(5fc,5/) < ci^=^^^^=, (2) 
k^l V log -i^ + log n 

then the TSC subspace detection property holds with probability at least 1 — Lne"^^"' — jjj^^j:^^ 
where ci and C2 are absolute constants satisfying < ci,C2 < 1. 

Thm. [1] states that TSC succeeds with high probability if max^^; affp(5'fc, S*;) is sufficiently 
small. Intuitively, we expect that clustering becomes easier when the number of data points in each 
subspace increases. Thm. [1] confirms this intuition as, for fixed d, q, and L, the right hand side 
(RHS) of ([2]) increases in p; moreover, the probability of success in Thm. [1] increases in n. If the 
number of subspaces, L, increases, for fixed d, q and n, clustering intuitively becomes harder and, 
indeed, the RHS of ([2]) is seen to decrease in L. Note that Thm. [1] does not apply to subspaces 
that intersect as aSp{Sk, Si) = 1 if Sk and Si intersect and the RHS of ([2]) is strictly smaller than 
1. We next present a result analogous to Thm. [1] that applies to intersecting subspaces. 

Theorem 2. Suppose n = pq data points are chosen in each of the L subspaces at random according 
to :x.j'^ = \J^''^SLj\j = 1, ...,n, where the aS^ are i.i.d. uniform on S'^^^ and p> 6. If 

maxaff(5'fc,5;) < / , (3) 
k^l 13 log A* 

then the TSC subspace detection property holds with probability at least 1 — 3/A^ — Ne^^"^ , where c 
is an absolute constant. 

The interpretation of Thm. [2] is analogous to that of Thm. [1] with the important difference that 
the RHS of dS]), as opposed to the RHS of ([2]), decreases, albeit slowly, in n (recall that = Ln). 
For SSC a result in the fiavor of Thm. [2] was reported in [5l Thm. 2.8]. 



4 Erasures 



In practical applications the data points to be clustered are often corrupted by erasures, e.g., im- 
ages that need to be clustered could exhibit scratches. Understanding the impact of erasures on 
clustering performance is obviously of significant importance. The literature seems, however, essen- 
tially void of corresponding analytic results. In the deterministic subspace setting such results will 
necessarily depend on the specific orientations of the subspaces. In the following, we therefore take 
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both the orientations of the subspaces and the points in each subspace to be random. Specifically, 
we take the entries of the U^'^ G j^mxd i.i.d. AA(0, 1/m), which ensures that each of the U^'^ 

is approximately orthonormal with high probability. 

Theorem 3. Suppose n = pq data points are chosen in each of the L subspaces at random according 
to Xj'^ = U'^'^a^'^j = 1, ...,n, where the aS^ are i.i.d. J\f{0, (1/d) I) and p > 10/3. Assume that in 
each Xj up to s entries (possibly different for each xj) are erased, i.e., set to 0. Let the entries of 
each matrix U^'^ G M™^"^ he i.i.d. A/'(0,l/m). // 

m > C2—j== dlog C4 f- +slog — +logL +cos, 

then the TSC subspace detection property holds with probability at least 1 — Lne~^^'^ — (^i_iyi^i — 
4g-cim,^ Co, ci, C2, C3, C4 > are absolute constants. 

Strikingly, Thm. [3] shows that the number of erasures is allowed to scale (up to a log-factor) 
linearly in the ambient dimension. 

For the fully random data model used in Thm. [3] we can furthermore conclude that TSC succeeds 
with high probability even when the dimensions of the subspaces scale (up to a log-factor) linearly 
in the ambient dimension. Drawing such a conclusion from Thm. [T] or Thm. [5] seems difficult as the 
relation between m, d, and L is implicit in the affinity measures. These findings should, however, 
be taken with a grain of salt as the fully random subspace model ensures that the subspaces are 
approximately disjoint with high probability. In the erasure-free case, i.e., for s = 0, a result for 
SSC, analogous to Thm. [3l was reported in [.5l Thm. 1.2]. 



5 Detection of outliers 

Outliers are data points that do not lie in one of the low-dimensional subspaces Si and have no 
low-dimensional linear structure. Here, this is modeled by assuming random outliers distributed 
uniformly on the unit sphere of M™. The outlier detection criterion we employ does not need 
knowledge of the number of outliers A'^o and is based on the following observation. The maximum 
inner product between an outlier and any other point in X is, with high probability, smaller 
than c\/logN j \/m. We therefore classify outlier if maxp^j |(xp,xj)| < cVb^/V^. The 

maximum inner product between any point Xj G Xi and the points in Xi \ Xj is unlikely to be 
smaller than l/^/d. Hence an inlier is unlikely to be misclassified as an outlier if d/m is sufficiently 
small. 

Theorem 4. Suppose n = pq data points are chosen in each of the L subspaces at random according 
to Xj''* = U^'^aj'^ j = 1, ...,n, where the slj^ are i.i.d. uniform on S'^~^ and each U^'^ is orthonormal. 
Let the Nq outliers be i.i.d. uniform on S""~^. Declare Xj £ X to be an outlier i/maXp^j |(xp,Xj)| < 
"v/BTogiV I \fm. Then, with N = Ln + Nq, provided that 



m 6 log N 

with probability at least 1 — 2Nq/N'^ — nLe^^°^^'^^'^^^^~^\ every outlier is detected and no point in a 
subspace is misclassified as an outlier. 
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Figure 1: Errors as a function of the dimension of the subspaces, d, on the vertical and p on the 
horizontal axis. 



Since ([4]) can be rewritten as No < e^d — Ln, we can conclude that outlier detection succeeds, 
even if the number of outliers scales exponentially in m/d, i.e., if d is kept constant, exponentially 
in the ambient dimension! Note that this result does not make any assumptions on the orientations 
of the subspaces Si. The outlier detection scheme proposed in [5] allows to identify outliers under 
a very similar condition. However, it requires the solution of -minimization problems, each in 
N unknowns, while the algorithm proposed here needs to compute A^^ inner products followed by 
thresholding only. 



6 Numerical results 

We use the performance measures employed in O [8]. The clustering error (CE) is defined as 
the ratio between the number of misclassified data points and the total number of points in X. 
The error in estimating the number of subspaces L is denoted as EL and takes on the value 
if the estimate is correct, else it is equal to 1. The feature detection error (FDE) is defined as 
Jf X^i^i (l ~ l|t>Xi ||2/l|t)i||2) , where b, is the ith column of the adjacency matrix A and bx^ is the 
vector containing the entries of bj corresponding to the subspace Xj lives in. The FDE measures 
to which extent points from different subspaces are connected in G and is equal to zero if the TSC 
subspace detection property holds. 

Influence of d, p, and erasures: We generate L = 15 subspaces in M.^^ at random, by choosing 
the corresponding U^'^ uniformly at random from the set of orthonormal matrices in M™^*^, and 
vary the number of points n = dp in each subspace. The points in the subspaces are chosen at 
random according to the probabilistic model in Thm. [3l The results depicted in Fig. [1] show, as 
indicated in Sec. [U that TSC can, indeed, succeed even when the TSC subspace detection property 
does not hold. Finally, we perform the same experiment, but erase the entries of Xj with indices 
in Di, where is chosen independently for each Xj and uniformly from {D C {1, m} : |P| = s}. 
The results summarized in Fig. [2] show that TSC succeeds, even when a large fraction of the entries 
is erased. 

Detection of outliers: In order to allow for a comparison with the outlier detection scheme 
proposed in [5], we perform our experiment with the same parameters as used in O Sec. 5.2]. 
Specifically, we set d = 5, vary m = {50, 100, 200}, and generate L = 2m/d subspaces and n = 5d 
points in each subspace at random as in the previous paragraph. Each of the A'o = Ln outliers 
is chosen i.i.d. uniformly on S^~^. Note that we have as many outliers as inliers. We find a 
misclassification error probability of {0.017, 1.510"'^, 2. 510~^} for m = {50,100,200}, respectively. 
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s = s = 5 s = 10 s = 15 
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Figure 2: CE as a function of the dimension of the subspaces, d, on the vertical and p on the 
horizontal axis. 

Similar performance was reported for the scheme proposed in [5]. 
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